revealing multimodal contrastive representation learning
Revealing Multimodal Contrastive Representation Learning through Latent Partial Causal Models
Liu, Yuhang, Zhang, Zhen, Gong, Dong, Huang, Biwei, Gong, Mingming, Hengel, Anton van den, Zhang, Kun, Shi, Javen Qinfeng
One promising methods have proven successful across a range strategy in this context is to use data from one modality, e.g., of domains, partly due to their ability to generate text data, as a supervision signal in the interpretation of another, meaningful shared representations of complex e.g., image data (Mori et al., 1999; Wang et al., 2009; phenomena. To enhance the depth of analysis Ramanathan et al., 2013; He & Peng, 2017; Radford et al., and understanding of these acquired representations, 2021). The primary approach for achieving this is known we introduce a unified causal model specifically as multimodal contrastive representation learning, which designed for multimodal data. By examining focuses on optimizing a symmetric contrastive loss (Zhang this model, we show that multimodal contrastive et al., 2022; Radford et al., 2021), e.g., a symmetric adaptation representation learning excels at identifying latent of the standard contrastive loss (Wu et al., 2018; Tian coupled variables within the proposed unified et al., 2020; He et al., 2020; Chen et al., 2020). The learned model, up to linear or permutation transformations representations, guided by the symmetric contrastive loss, resulting from different assumptions. Our have been applied in a variety of applications, including findings illuminate the potential of pre-trained zero/few-shot learning (Radford et al., 2021; Zhou et al., multimodal models, e.g., CLIP, in learning disentangled 2022a), domain generalization (Zhou et al., 2022a;b), and representations through a surprisingly robustness to adversarial examples (Ban & Dong, 2022).